Featurizing Text: Converting Text into Predictors for Regression Analysis

نویسندگان

  • Dean P. Foster
  • Mark Liberman
  • Robert A. Stine
چکیده

Modern data streams routinely combine text with the familiar numerical data used in regression analysis. For example, listings for real estate that show the price of a property typically include a verbal description. Some descriptions include numerical data, such as the number of rooms or the size of the home. Many others, however, only verbally describe the property, often using an idiosyncratic vernacular. For modeling such data, we describe several methods that that convert such text into numerical features suitable for regression analysis. The proposed featurizing techniques create regressors directly from text, requiring minimal user input. The techniques range naive to subtle. One can simply use raw counts of words, obtain principal components from these counts, or build regressors from counts of adjacent words. Our example that models real estate prices illustrates the surprising success of these methods. To partially explain this success, we offer a motivating probabilistic model. Because the derived regressors are difficult to interpret, we further show how the presence of partial quantitative features extracted from text can elucidate the structure of a model. Key Phrases: sentiment analysis, n-gram, latent semantic analysis, text mining ∗Research supported by NSF grant 1106743

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

L2 Learners’ Lexical Inferencing: Perceptual Learning Style Preferences, Strategy Use, Density of Text, and Parts of Speech as Possible Predictors

This study was intended first to categorize the L2 learners in terms of their learning style preferences and second to investigate if their learning preferences are related to lexical inferencing. Moreover, strategies used for lexical inferencing and text related issues of text density and parts of speech were studied to determine their moderating effects and the best predictors of lexical infe...

متن کامل

Determining the Predictors and a Cross-gender Analysis for Messaging Satisfaction

A total of 110 youths were interviewed to determine the important predictors for mobile phone messaging satisfaction based on the mobile phone design and health effect factors. A cross-gender analysis was also performed to analyze the gender differences towards messaging satisfaction. Factor analysis resulted in seven independent variables viz. Mobile Phone Design, Keypad Design, Screen Design,...

متن کامل

The Impact of Input Enrichment in Long Text vs. Short Texts on Grammatical Accuracy in Writing Among Elementary Language Learners

This study was conducted to investigate the influence of teaching accurate grammar inwriting via enriched long text and short text for the elementary students atShokouhe_Farhang institute. The homogenized subjects were divided into two groups of 18and 17 participants. Using a writing exam as a pretest in order to check the students’knowledge in English past tense. The control group received the...

متن کامل

آشکارسازی و تعیین مکان متون فارسی - عربی در تصاویر ویدیویی

Video text detection plays an important role in applications such as semantic-based video analysis, text information retrieval, archiving and so on. In this paper, we propose a Farsi/Arabic text detection approach. First, with an appropriate edge detector, edges are extracted and then by using edges cross ponts, artificial corners are extracted. Artificial corner histogram analysis is done for ...

متن کامل

Systemic Functional Linguistics as a Tool of Text Analysis for Translation

Translation, ipso facto, is an understanding and a transferal of meaning from one language into another. Therefore, it may be fitting to conclude that a suitable semantic theory should underpin any attempt to that end. This paper advocates implementing Systemic Functional Linguistics (henceforth SFL) which subscribes to a view of language as a "meaning-potential". In fact, Halliday and Matthies...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013